Cluster analysis
-
Two types of unsupervised cluster analysis: partitioning and hierarchical.
-
Partitioning algorithms separate objects(features) into a finite set of disjoint subsets, each of which has a center based on the average feature values within the cluster.
-
Hierarchical algorithms order objects(features) into a hierarchically nested sequence, or tree structure, for which the result is a dendogram that contains a root, branches, and leaves.
-
-
Resemblance coefficients or distance metrics, which form a matrix of all pairwise comparisons of similarity or dissimilarity between objects.
-
Correlation is a similarity coefficient.
-
Computationally efficient variant of correlation is Pitman correlation
-
Other names for Euclidean and Manhattan distance are the $L_1$ and $L_2$ distance, respectively (see here)
- Dot production: $x_l'x_m$
- Norm of a vector: $||\mathbf{x}|| = \sqrt{x_1^2 + \cdots + x_p^2}$
-
Cluster validity:
- Ideal cluster structure will reveal clusters whose centers are far apart and whose assigned objects are all close in proximity.
- The more disjoint the clusters are and the less overlap, the greater the chance that the specified number of clusters is the optimal choice.
- Silhouette index measures the degree of membership of objects within their assigned clusters.
- if all of the objects in a cluster have the same feature values, then the average intra-cluster distance $a(i) = 0$, resulting in a numerator of $b(i) - 0 = b(i)$, denominator of $\max{0,b(i)} = b(i)$ and ratio $s(i)$ of unity.
- if $a(i) > b(i)$, the numerator $b(i) − a(i)$ and $s(i)$ become negative, indicating that $x_i$ is misclassified.
- if there is no real cluster structure present, both $a(i)$ and $b(i)$ will be similar causing s(i) to approach zero.
- singly, $s(i)$ by itself only reflects the cluster support for $x_i$ , so the average silhouette index is used to measure the overall clustering validity
- Strong evidence of cluster structure occurs if $0.7< s\le 1$, reasonable evidence when $0.5
Gaussian mixture models
- Model-based clustering makes the assumption that a finite mixture of probability distributions is responsible for generating the data under consideration.
- Model-based cluster analysis is an extension of K-means cluster analysis in which maximum likelihood estimation is used.
- It is the evaluation and comparison of how a selected mixture of probability density functions can best describe the cluster structure.